引入下面的记号:
神经网络第$l$层的计算过程:$\zv_l = \Wv_l \av_{l-1} + \bv_l$,$\av_l = h_l (\zv_l)$
整个网络:$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
最早的 M-P 模型采用阶跃函数$\sgn(\cdot)$作为激活函数
改进方向:
常见的有
将$\Rbb$挤压到$[0,1]$,输出拥有概率意义:
$$ \begin{align*} \quad \sigma(z) = \frac{1}{1 + \exp (-z)} = \begin{cases} 1, & z \rightarrow \infty \\ 0, & z \rightarrow -\infty \end{cases} \end{align*} $$
对率函数连续可导,在零处导数最大
$$ \begin{align*} \quad \nabla \sigma(z) = \sigma(z) (1 - \sigma(z)) \le \left( \frac{\sigma(z) + 1 - \sigma(z)}{2} \right)^2 = \frac{1}{4} \end{align*} $$
均值不等式等号成立的条件是$\sigma(z) = 1 - \sigma(z)$,即$z = 0$
将$\Rbb$挤压到$[-1,1]$,输出零中心化,对率函数的放大平移
$$ \begin{align*} \quad \tanh(z) & = \frac{\exp(z) - \exp(-z)}{\exp(z) + \exp(-z)} = \frac{1 - \exp(-2z)}{1 + \exp(-2z)} = 2 \sigma(2z) - 1 \\[2pt] & = \begin{cases} 1, & z \rightarrow \infty \\ -1, & z \rightarrow -\infty \end{cases} \\[10pt] \nabla \tanh(z) & = 4 \sigma(2z) (1 - \sigma(2z)) \le 1 \end{align*} $$
双曲正切函数连续可导,在$z = 0$处导数最大
输出零中心化使得非输入层的输入都在零附近,而双曲正切函数在零处导数最大,梯度下降更新效率较高,对率函数输出恒为正,会减慢梯度下降的收敛速度
整流线性单元 (rectified linear unit, ReLU):
$$ \begin{align*} \quad \relu(z) = \max \{ 0, z \} = \begin{cases} z & z \ge 0 \\ 0 & z < 0 \end{cases} \end{align*} $$
优点
缺点
由链式法则有
$$ \begin{align*} \quad \nabla_{\wv} \relu(\wv^\top \xv + b) & = \frac{\partial \relu(\wv^\top \xv + b)}{\partial (\wv^\top \xv + b)} \frac{\partial (\wv^\top \xv + b)}{\partial \wv} \\ & = \frac{\partial \max \{ 0, \wv^\top \xv + b \}}{\partial (\wv^\top \xv + b)} \xv \\ & = \Ibb(\wv^\top \xv + b \ge 0) \xv \end{align*} $$
如果第一个隐藏层中的某个神经元对应的$(\wv,b)$初始化不当,使得对任意$\xv$有$\wv^\top \xv + b < 0$,那么其关于$(\wv,b)$的梯度将为零,在以后的训练过程中永远不会被更新
解决方案:带泄漏的 ReLU,带参数的 ReLU,ELU,Softplus
带泄漏的 ReLU:当$\wv^\top \xv + b < 0$时也有非零梯度
$$ \begin{align*} \quad \lrelu(z) & = \begin{cases} z & z \ge 0 \\ \gamma z & z < 0 \end{cases} \\ & = \max \{ 0, z \} + \gamma \min \{ 0, z \} \overset{\gamma < 1}{=} \max \{ z, \gamma z \} \end{align*} $$
其中斜率$\gamma$是一个很小的常数,比如$0.01$
带参数的 ReLU:斜率$\gamma_i$可学习
$$ \begin{align*} \quad \prelu(z) & = \begin{cases} z & z \ge 0 \\ \gamma_i z & z < 0 \end{cases} \\[4pt] & = \max \{ 0, z \} + \gamma_i \min \{ 0, z \} \end{align*} $$
可以不同神经元有不同的参数,也可以一组神经元共享一个参数
指数线性单元 (exponential linear unit, ELU)
$$ \begin{align*} \quad \elu(z) & = \begin{cases} z & z \ge 0 \\ \gamma (\exp(z) - 1) & z < 0 \end{cases} \\[4pt] & = \max \{ 0, z \} + \min \{ 0, \gamma (\exp(z) - 1) \} \end{align*} $$
Softplus 函数可以看作 ReLU 的平滑版本:
$$ \begin{align*} \quad \softplus(z) = \ln (1 + \exp(z)) \end{align*} $$
其导数为对率函数
$$ \begin{align*} \quad \nabla \softplus(z) = \frac{\exp(z)}{1 + \exp(z)} = \frac{1}{1 + \exp(-z)} \end{align*} $$
Swish 函数是一种自门控 (self-gated) 激活函数:
$$ \begin{align*} \quad \swish(z) = z \cdot \sigma (\beta z) = \frac{z}{1 + \exp(-\beta z)} \end{align*} $$
其中$\beta$是一个可学习的参数
考虑神经网络的第$l$层:
$$ \begin{align*} \quad \zv_l & = \Wv_l \av_{l-1} + \bv_l \\ \av_l & = h_l (\zv_l) \end{align*} $$
前面提到的激活函数都是$\Rbb \mapsto \Rbb$的,即$[\av_l]_i = h_l ([\zv_l]_i), ~ i \in [n_l]$
Maxout 单元是$\Rbb^{n_l} \mapsto \Rbb$的,输入就是$\zv_l$,其定义为
$$ \begin{align*} \quad \maxout (\zv) = \max_{k \in [K]} \{ \wv_k^\top \zv + b_k \} \end{align*} $$
前$L-1$层是复合函数$\psi: \Rbb^d \mapsto \Rbb^{n_{L-1}}$,可看作一种特征变换方法
最后一层是学习器$\hat{\yv} = g(\psi(\xv); \Wv_L, \bv_L)$,对输入进行预测
对率回归也可看作只有一层(没有隐藏层)的神经网络
传统机器学习:特征工程和模型学习两阶段分开进行
深度学习:特征工程和模型学习合二为一,端到端 (end-to-end)
整个网络:$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
神经网络的优化目标为
$$ \begin{align*} \quad \min_{\Wv, \bv} ~ \frac{1}{m} \sum_{i \in [m]} \ell (\yv_i, \hat{\yv}_i) \end{align*} $$
其中损失$\ell (\yv, \hat{\yv})$的计算为正向传播
梯度下降更新公式为
$$ \begin{align*} \quad \Wv ~ \leftarrow ~ \Wv - \frac{\eta}{m} \sum_{i \in [m]} \class{yellow}{\frac{\partial \ell (\yv_i, \hat{\yv}_i)}{\partial \Wv}}, \quad \bv ~ \leftarrow ~ \bv - \frac{\eta}{m} \sum_{i \in [m]} \class{yellow}{\frac{\partial \ell (\yv_i, \hat{\yv}_i)}{\partial \bv}} \end{align*} $$
整个网络:$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
最后一层$\zv_L = \Wv_L ~ \av_{L-1} + \bv_L$,$\av_L = h_L (\zv_L)$,由链式法则有
$$ \begin{align*} \quad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \bv_L} & = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_L} \frac{\partial \zv_L}{\partial \bv_L} = \deltav_L^\top \frac{\partial \zv_L}{\partial \bv_L} = \deltav_L^\top \\ \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_L} & = \sum_{j \in [n_L]} \frac{\partial \ell (\yv, \hat{\yv})}{\partial [\zv_L]_j} \frac{\partial [\zv_L]_j}{\partial \Wv_L} = \sum_{j \in [n_L]} [\deltav_L]_j \frac{\partial [\zv_L]_j}{\partial \Wv_L} \end{align*} $$
其中$\deltav_L^\top = \partial \ell (\yv, \hat{\yv}) / \partial \zv_L \in \Rbb^{n_L}$为第$L$层的误差项,可直接求解
类似的,对第$l$层$\zv_l = \Wv_l \av_{l-1} + \bv_l$,$\av_l = h_l (\zv_l)$,由链式法则有
$$ \begin{align*} \quad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \bv_l} = \deltav_l^\top, \quad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_l} = \sum_{j \in [n_l]} [\deltav_l]_j \frac{\partial [\zv_l]_j}{\partial \Wv_l} \end{align*} $$
其中$\deltav_l^\top = \partial \ell (\yv, \hat{\yv}) / \partial \zv_l \in \Rbb^{n_l}$为第$l$层的误差项
反向传播 (backpropagation, BP):前一层误差由后一层得到
$$ \begin{align*} \quad \deltav_{l-1}^\top = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_{l-1}} = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_l} \frac{\partial \zv_l}{\partial \av_{l-1}} \frac{\partial \av_{l-1}}{\partial \zv_{l-1}} = \deltav_l^\top \Wv_l \frac{\partial h_{l-1}(\zv_{l-1})}{\partial \zv_{l-1}} \end{align*} $$
最后对第$l$层$\zv_l = \Wv_l \av_{l-1} + \bv_l$,如何求$\partial [\zv_l]_j / \partial \Wv_l$?
注意$[\zv_l]_j = \sum_k [\Wv_l]_{jk} [\av_{l-1}]_k + [\bv_l]_j$只与$\Wv_l$的第$j$行有关,于是
$$ \begin{align*} \quad & \frac{\partial [\zv_l]_j}{\partial \Wv_l} = \underbrace{\begin{bmatrix} \zerov, \ldots, \av_{l-1}, \ldots, \zerov \end{bmatrix}}_{\text{only }\av_{l-1}\text{ at }j\text{-th column}} = \av_{l-1} \ev_j^\top \\[4pt] \quad & \Longrightarrow \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_l} = \sum_{j \in [n_l]} [\deltav_l]_j \frac{\partial [\zv_l]_j}{\partial \Wv_l} = \av_{l-1} \sum_{j \in [n_l]} [\deltav_l]_j \ev_j^\top = \av_{l-1} \deltav_l^\top \end{align*} $$
输入:训练集,验证集,相关超参数
输出:$\Wv$和$\bv$
import numpy as np from sklearn.neural_network import MLPClassifier mlp = MLPClassifier( hidden_layer_sizes=(h), # 隐藏层神经元个数 activation='logistic', # identity, logistic, tanh, relu max_iter=100, # 最大迭代轮数 solver='lbfgs', # 求解器 alpha=0, # 正则项系数 batch_size=32, # 批量大小 learning_rate='constant', # constant, invscaling, adaptive shuffle=True, # 每轮是否将样本重新排序, momentum=0.9, # 动量法系数, sgd only nesterovs_momentum=True, # 动量法用Nesterov加速 early_stopping=False, # 是否提早停止 warm_start=False, # 是否开启热启动机制 random_state=1, verbose=False ... ) clf = mlp.fit(X, y) acc = clf.score(X, y)
from tensorflow.keras.layers import Dense from tensorflow.keras.models import Sequential from tensorflow.keras.optimizers import Adam model = Sequential() model.add(Dense(units=3, activation="sigmoid", input_shape=(2, ))) model.add(Dense(units=1, activation='sigmoid')) model.summary() # 打印模型 _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 3) 9 dense_1 (Dense) (None, 1) 4 ================================================================= Total params: 13 Trainable params: 13 Non-trainable params: 0 _________________________________________________________________ model.compile( optimizer=Adam(0.1), loss="binary_crossentropy", metrics=['accuracy'] ) model.fit(X, y, epochs=10, batch_size=32) Epoch 1/10 32/32 [==============] - 0s 1ms/step - loss: 0.6481 - accuracy: 0.6309 Epoch 2/10 32/32 [==============] - 0s 1ms/step - loss: 0.5064 - accuracy: 0.7500 Epoch 3/10 32/32 [==============] - 0s 1000us/step - loss: 0.3309 - accuracy: 0.8369 Epoch 4/10 32/32 [==============] - 0s 1ms/step - loss: 0.1383 - accuracy: 1.0000 Epoch 5/10 32/32 [==============] - 0s 1ms/step - loss: 0.0643 - accuracy: 1.0000 Epoch 6/10 32/32 [==============] - 0s 1ms/step - loss: 0.0395 - accuracy: 1.0000 Epoch 7/10 32/32 [==============] - 0s 1ms/step - loss: 0.0276 - accuracy: 1.0000 Epoch 8/10 32/32 [==============] - 0s 1ms/step - loss: 0.0208 - accuracy: 1.0000 Epoch 9/10 32/32 [==============] - 0s 994us/step - loss: 0.0165 - accuracy: 1.0000 Epoch 10/10 32/32 [==============] - 0s 997us/step - loss: 0.0134 - accuracy: 1.0000 loss, acc = model.evaluate(X, y, verbose=2) 32/32 - 0s - loss: 0.0121 - accuracy: 1.0000 - 93ms/epoch - 3ms/step
神经网络中误差反向传播的迭代公式为
$$ \begin{align*} \quad \deltav_l^\top = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_l} = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_{l+1}} \frac{\partial \zv_{l+1}}{\partial \av_l} \frac{\partial \av_l}{\partial \zv_l} = \deltav_{l+1}^\top \Wv_{l+1} \diag (h_l'(\zv_l)) \end{align*} $$
对于 Sigmoid 型激活函数
误差每传播一层都会乘以一个小于等于$1$的系数,当网络层数很深时,梯度会不断衰减甚至消失,使得整个网络很难训练
解决方案:使用导数比较大的激活函数,比如 ReLU
残差模块 $\zv_l = \av_{l-1} + \class{yellow}{\Uv_2 \cdot h(\Uv_1 \cdot \av_{l-1} + \cv_1) + \cv_2} = \av_{l-1} + \class{yellow}{f(\av_{l-1})}$
假设$\av_l = \zv_l$,即残差模块输出不使用激活函数,对$\forall t \in [l]$有
$$ \begin{align*} \quad \av_l = \av_{l-1} + f(\av_{l-1}) = \av_{l-2} + f(\av_{l-2}) + f(\av_{l-1}) = \cdots = \av_{l-t} + \sum_{i=l-t}^{l-1} f(\av_i) \end{align*} $$
低层输入可以恒等传播到任意高层
低层输入可以恒等传播到任意高层
$$ \begin{align*} \quad \av_l = \av_{l-t} + \sum_{i=l-t}^{l-1} f(\av_i) \end{align*} $$
由链式法则有
$$ \begin{align*} \quad \frac{\partial \ell}{\partial \av_{l-t}} & = \frac{\partial \ell}{\partial \av_l} \frac{\partial \av_l}{\partial \av_{l-t}} = \frac{\partial \ell}{\partial \av_l} \left( \frac{\partial \av_{l-t}}{\partial \av_{l-t}} + \frac{\partial }{\partial \av_{l-t}} \sum_{i=l-t}^{l-1} f(\av_i) \right) \\ & = \frac{\partial \ell}{\partial \av_l} \left( \Iv + \frac{\partial }{\partial \av_{l-t}} \sum_{i=l-t}^{l-1} f(\av_i) \right) \\ & = \frac{\partial \ell}{\partial \av_l} + \frac{\partial \ell}{\partial \av_l} \left( \frac{\partial }{\partial \av_{l-t}} \sum_{i=l-t}^{l-1} f(\av_i) \right) \end{align*} $$
高层误差可以恒等传播到任意低层,梯度消失得以缓解
神经网络已被扩展到多种类型的数据上
图像数据集 ImageNet:
用全连接网络训练 ImageNet
总参数量为 (50,176 + 1,000) × 10,000 = 511,760,000
局部连接:每个神经元只与前一层确定数量的 (远小于总数) 神经元相连
权值共享:确定数量的神经元均采用相同的输入权重系数
限制神经元的输入权重个数,降低参数规模,降低模型复杂度
$$ \begin{align*} \qquad \qquad \qquad \qquad a_1 & = x_1 \times w_1 + x_2 \times w_2 + x_3 \times w_3 \\ a_2 & = x_2 \times w_1 + x_3 \times w_2 + x_4 \times w_3 \\ a_3 & = x_3 \times w_1 + x_4 \times w_2 + x_5 \times w_3 \\ a_4 & = x_4 \times w_1 + x_5 \times w_2 + x_6 \times w_3 \end{align*} $$
$$ \begin{align*} \quad (f \otimes g) [n] = \sum_{m = -\infty}^\infty f[m] \cdot g[n-m] \end{align*} $$
取$f[i] = x_i$,$g[-2] = w_3$,$g[-1] = w_2$,$g[0] = w_1$,其余为零
$$ \begin{align*} \quad a_n = x_n w_1 + x_{n+1} w_2 + x_{n+2} w_3 = \sum_{m = -\infty}^\infty f[m] \cdot g[n-m] = (f \otimes g) [n] \end{align*} $$
针对输入是矩阵的情形
参与卷积的深色区域称为对应输出神经元的感受野 (receptive field)
平滑去噪
$\otimes ~ \begin{bmatrix} \frac{1}{9} & \frac{1}{9} & \frac{1}{9} \\ \frac{1}{9} & \frac{1}{9} & \frac{1}{9} \\ \frac{1}{9} & \frac{1}{9} & \frac{1}{9} \end{bmatrix} ~ =$
边缘提取
$\otimes ~ \begin{bmatrix} 0 & 1 & 1 \\ -1 & 0 & 1 \\ -1 & -1 & 0 \end{bmatrix} ~ = $
汇聚 (pooling) 层也叫子采样 (subsampling) 层
将区域下采样为一个值,减少网络参数,降低模型复杂度
卷积神经网络由卷积层、汇聚层、全连接层交叉堆叠而成
趋势
import numpy as np import tensorflow as tf from tensorflow.keras import Sequential from tensorflow.keras.datasets.mnist import load_data from tensorflow.keras.layers import (Conv2D, Dense, Dropout, Flatten, AveragePooling2D) from tensorflow.keras.optimizers import Adam (x_train, y_train), (x_test, y_test) = load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 x_train = np.expand_dims(x_train, -1) x_test = np.expand_dims(x_test, -1) model = Sequential() model.add(Conv2D(6, (5, 5), activation="relu", padding="same", input_shape=(28, 28, 1))) model.add(AveragePooling2D(pool_size=(2, 2))) model.add(Conv2D(16, (5, 5), activation="relu")) model.add(AveragePooling2D(pool_size=(2, 2))) model.add(Conv2D(120, (5, 5), activation="relu")) model.add(Flatten()) model.add(Dense(84, activation="relu")) model.add(Dense(10, activation="softmax")) model.summary() # Model: "sequential" # _________________________________________________________________ # Layer (type) Output Shape Param # # ================================================================= # conv2d (Conv2D) (None, 28, 28, 6) 156 # # average_pooling2d (AverageP (None, 14, 14, 6) 0 # ooling2D) # # conv2d_1 (Conv2D) (None, 10, 10, 16) 2416 # # average_pooling2d_1 (Averag (None, 5, 5, 16) 0 # ePooling2D) # # conv2d_2 (Conv2D) (None, 1, 1, 120) 48120 # # flatten (Flatten) (None, 120) 0 # # dense (Dense) (None, 84) 10164 # # dense_1 (Dense) (None, 10) 850 # # ================================================================= # Total params: 61,706 # Trainable params: 61,706 # Non-trainable params: 0 model.compile( optimizer=Adam(0.001), loss="sparse_categorical_crossentropy", metrics=["accuracy"] ) model.fit(x_train, y_train, epochs=5, verbose=1) # Epoch 1/5 # 1875/1875 [==============================] - 7s 2ms/step - loss: 0.2085 - accuracy: 0.9356 # Epoch 2/5 # 1875/1875 [==============================] - 4s 2ms/step - loss: 0.0657 - accuracy: 0.9795 # Epoch 3/5 # 1875/1875 [==============================] - 4s 2ms/step - loss: 0.0479 - accuracy: 0.9844 # Epoch 4/5 # 1875/1875 [==============================] - 4s 2ms/step - loss: 0.0372 - accuracy: 0.9886 # Epoch 5/5 # 1875/1875 [==============================] - 4s 2ms/step - loss: 0.0300 - accuracy: 0.9907 model.evaluate(x_test, y_test, verbose=2) # 313/313 - 1s - loss: 0.0409 - accuracy: 0.9860 - 536ms/epoch - 2ms/step
使用在 ImageNet 训练好的残差网络 ResNet50 进行图像分类
import numpy as np from tensorflow.keras.applications import resnet50 from tensorflow.keras.preprocessing import image model = resnet50.ResNet50(weights='imagenet') img = image.load_img('../img/tj/tj224x224.jpg', target_size=(224, 224)) # 增加通道数 RGB: (224, 224, 3) 灰度图: (224, 224, 1) x = image.img_to_array(img) x = np.expand_dims(x, axis=0) # batch_size = 1 x = resnet50.preprocess_input(x) # 中心化 preds = model.predict(x) print(resnet50.decode_predictions(preds, top=5)[0]) # ('n03630383', 'lab_coat', 0.24623604), 实验服 # ('n03877472', 'pajama', 0.17045474), 睡衣 # ('n04317175', 'stethoscope', 0.095500074), 听诊器 # ('n04479046', 'trench_coat', 0.07988542), 军用雨衣 # ('n03617480', 'kimono', 0.055965725), 和服
一个优秀丹师的自我修养: